Imperfect gold standard gene sets yield inaccurate evaluation of causal gene identification methods

Causal gene discovery methods are often evaluated using reference sets of causal genes, which are treated as gold standards (GS) for the purposes of evaluation. However, evaluation methods typically treat genes not in the GS positive set as known negatives rather than unknowns. This leads to inaccurate estimates of sensitivity, specificity, and AUC. Labeling biases in GS gene sets can also lead to inaccurate ordering of alternative causal gene discovery methods. We argue that the evaluation of causal gene discovery methods should rely on statistical techniques like those used for variant discovery rather than on comparison with GS gene sets.

evaluation based on PU-labeled data will correctly estimate sensitivity and overestimate specificity (Fig.  A.1a). However, if LP and UP genes differ on important features, sensitivity and specificity may be either over-or under-estimated (Appendix A). In genetics research, we expect labeling biases because there are multiple molecular mechanisms by which a causal gene can affect complex diseases, and different classification and GS identification methods will favor different mechanisms. Error in estimating sensitivity and specificity results in error in the ROC curve and therefore error in the area under the ROC curve (AUC). This means that this error applies to evaluation of ranking methods as well as methods that return only hard classifications.
To illustrate this issue, we consider a hypothetical example in which causal genes fall into two functional categories, expression-acting (gene-to-trait effect mediated by RNA expression level) and protein-acting (gene-to-trait effect mediated by variation in protein products). Only protein-acting genes are labeled positive in the GS gene set, mimicking that most LP genes are Mendelian genes in current practice. While LP genes are all protein-acting, UP genes are a mixture of protein-acting and expression-acting genes (Fig. 1a). We then consider two hypothetical PCG classifiers. The first, based on expression-related features (E-Classifier), has higher sensitivity to detect expression-acting genes; the second, based on protein-related features (P-Classifier), has a higher sensitivity to detect protein-acting causal genes. Figure 1c shows that the ROC curve for the E-Classifier is below the true ROC curve for this classifier, resulting in a downward bias in estimated AUC. In comparison, the evaluation for the P-Classifier, is overly optimistic. In this particular setting, the E-Classifier has a better overall ability to identify PCGs than the P-Classifier but is evaluated being much worse.
Due to the myriad biological mechanisms leading to complex phenotypes, it is currently impossible to confidently determine a comprehensive GS gene set that includes all causal genes for any trait. Several studies have acknowledged that supervised ML methods designed to classify PCGs should not be trained on incomprehensive GS gene sets [2], [5]. Here, we want to draw attention to the fact that sensitivity and specificity estimated using incomplete labels are also inaccurate, making it inappropriate to compare and evaluate methods using these measures or AUC estimated using GS gene sets. To address a similar issue in other fields, researchers have proposed incorporating negative controls (i.e., known negatives) or weights estimating each unit's probability of being detected based on its features ( [6], [1]). However, whether either approach is feasible for the PCG implication problem is unclear. An alternative approach that circumvents the issue is to use a statistical model-based approach for causal gene identification. Methods based on estimating parameters in probabilistic models provide model-based measures of uncertainty, such as posterior inclusion probabilities or confidence intervals. These methods can also be evaluated in simulations to test their sensitivity to violations of modeling assumptions. Such a paradigm has been established as a long-standing practice in genetics research to identify causal variants. Finally, we re-emphasize that researchers should avoid concluding the relative performance of different classifiers based on performance measures estimated using PU-labeled gene sets and should clearly acknowledge the limitations of any labeled gene set used in evaluation or training.

A Illustrating Effect of Mislabeling on Classification Accuracy
Receiver operating characteristics (ROC) analysis is one of the most common diagnostic method for binary classifications, and is constructed using a combination of sensitivity and specificity estimations. In this section we will demonstrate how mislabeled positive-negative (PN) data affects the observed sensitivity and specificity compared to perfectly labeled data.
A set of binary, labeled data is illustrated in figure A.1, where a portion of the true positives are labeled as negative. By definition, key accuracy measures can be calculated in figure 1b. We can easily establish that: always hold. Here PPV T denotes positive predictive value of classifier on true data labels, where PPV M denotes PPV of classifier using incorrect labels. Additionally, if the mislabeling among true positives occurs randomly, then we have: Setting γ = B+b A+a as the raio of mislabeled positives, we have: This implies that sensitivity estimate is accurate when error is complete random. However, if the classifier has some discrimination among the mislabeled genes, then: Setting α = B+b C+c as the ratio of mislabeled negatives, we have: This indicates that specificity will be underestimated as long as the classifier has any discriminating ability among the mislabeled genes.
We further illustrate a few different other mislabeling scenarios: When classifier among mislabeled genes performs worse than correctly-labeled true positives, but better than on the correctly-labeled true negatives (Fig. A.1b), we would have: If classifier has the worst performance in mislabeled genes ( b B+b < c C+c , Fig. A.1c), we would expect: If in turn the classifier has the best performance in mislabeled genes ( b B+b > a A+a , Fig. A.1d), we would expect:

B Mis-ordered Classifiers in ROC Analysis
We illustrate the biased evaluation results for data sets where a subset of genes are preferentially labeled. We simulate a gene set consisting of only protein-acting genes (P) and expression-acting (E) genes. We assume genetic variants within P genes and E genes to have different effect sizes due to varying degree of penetrance. Our goal here is to demonstrate that if our gold-standard (GS) set contains mostly one category of genes (e.g. mostly protein-acting genes), it could result in a classifier that also preferentially selects for protein-acting genes demonstrating superior performance when evaluated through ROC analysis. We simulate the truth to be under: where each gene i ∈ [1, n] has binary "causal" classification Y i ∈ (0, 1), a set of "protein-acting features" (e.g. SNPs within a protein coding gene), and a set of "expression-acting features" (e.g. eQTL), and their corresponding effect sizes β. In particular, we will simulate the following two cases: Two types of classifiers are trained using this simulated data: P-classifier: a classifier using only P-features: and E-classifier: a classifier using only E-features: Subsequently, we simulate biased labels to be biased towards either of the two classes of features: proteinacting or expression-acting features. P-feature biased labels are generated under the following model: and E-feature biased labels are generated under: Lastly, we multiply the simulated true labels Y ti and simulated biased labels Y bi so that our final observed truths among Y i = Y ti ·Y bi are a subset of the real truths. The classifiers are subsequently evaluated separately using the P-feature and E-feature biased labels (Fig. B.2).
We compare the classifiers evaluated against biased labels and classifiers evaluated against true labels. In particular, we want to demonstrate that depending on the effect size of the different features, we could have an opposite conclusion of the classifier performance. In figure B.3, when labels are biased towards the feature with larger effect size in the underlying data, we notice that ROC curve measured according to biased labels are not far from its evaluation according to the true labels, and would lead us to the same conclusion regarding relative performance of the two classifiers (Fig. B.3a, B.3d). However, when the labels are biased towards feature with smaller effect size, the observed classifier performance is the opposite from the underlying truth ( fig. B.3b